156 research outputs found

    Using ESTs for phylogenomics: Can one accurately infer a phylogenetic tree from a gappy alignment?

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>While full genome sequences are still only available for a handful of taxa, large collections of partial gene sequences are available for many more. The alignment of partial gene sequences results in a multiple sequence alignment containing large gaps that are arranged in a staggered pattern. The consequences of this pattern of missing data on the accuracy of phylogenetic analysis are not well understood. We conducted a simulation study to determine the accuracy of phylogenetic trees obtained from gappy alignments using three commonly used phylogenetic reconstruction methods (Neighbor Joining, Maximum Parsimony, and Maximum Likelihood) and studied ways to improve the accuracy of trees obtained from such datasets.</p> <p>Results</p> <p>We found that the pattern of gappiness in multiple sequence alignments derived from partial gene sequences substantially compromised phylogenetic accuracy even in the absence of alignment error. The decline in accuracy was beyond what would be expected based on the amount of missing data. The decline was particularly dramatic for Neighbor Joining and Maximum Parsimony, where the majority of gappy alignments contained 25% to 40% incorrect quartets. To improve the accuracy of the trees obtained from a gappy multiple sequence alignment, we examined two approaches. In the first approach, alignment masking, potentially problematic columns and input sequences are excluded from from the dataset. Even in the absence of alignment error, masking improved phylogenetic accuracy up to 100-fold. However, masking retained, on average, only 83% of the input sequences. In the second approach, alignment subdivision, the missing data is statistically modelled in order to retain as many sequences as possible in the phylogenetic analysis. Subdivision resulted in more modest improvements to alignment accuracy, but succeeded in including almost all of the input sequences.</p> <p>Conclusion</p> <p>These results demonstrate that partial gene sequences and gappy multiple sequence alignments can pose a major problem for phylogenetic analysis. The concern will be greatest for high-throughput phylogenomic analyses, in which Neighbor Joining is often the preferred method due to its computational efficiency. Both approaches can be used to increase the accuracy of phylogenetic inference from a gappy alignment. The choice between the two approaches will depend upon how robust the application is to the loss of sequences from the input set, with alignment masking generally giving a much greater improvement in accuracy but at the cost of discarding a larger number of the input sequences.</p

    Data reuse and scholarly reward: understanding practice and building infrastructure

    Get PDF
    Recently introduced funding agency policies seek to increase the availability of data from individual published studies for reuse by the research community at large. The success of such policies can be measured both by data input (“is useful data being made available?”) and research output (“are these data being reused by others?”). A key determinant of data input is the extent to which data producers receive adequate professional credit for making data available. One of us (HP) previously reported a large citation difference for published microarray studies with and without data available in a public repository. Analysis of a much larger sample, with more covariates, provides a more reliable estimate of this citation boost, as well as additional insights into patterns of reuse and how the availability of data affects publication impact. A more recent study tracking the reuse of 100 datasets from each of ten different primary data repositories reveals large variation in patterns of reuse and citation. Our findings (a) illuminate ways in which the reuses of archived data tend to differ in purpose from that of the original producers; (b) inform data archiving policy, such as how long data embargoes need to be in order to protect the proprietary interests of producers; (c) and allow us to answer the vexing question of what the return on investment is for data archiving. In conducting these studies, we have become aware of gaps in data citation practice and infrastructure that limit the extent to which researchers receive credit for their contributions. We describe early efforts to bake good data citation and usage tracking into cyberinfrastructure as part of DataONE, the Data Observation Network for Earth. Finally, we introduce total-impact, a tool that allows researchers to track the diverse impacts of all their research outputs, including data, and empowers them to be recognized for their scholarly work on their own terms

    Phytome: a platform for plant comparative genomics

    Get PDF
    Phytome is an online comparative genomics resource that can be applied to functional plant genomics, molecular breeding and evolutionary studies. It contains predicted protein sequences, protein family assignments, multiple sequence alignments, phylogenies and functional annotations for proteins from a large, phylogenetically diverse set of plant taxa. Phytome serves as a glue between disparate plant gene databases both by identifying the evolutionary relationships among orthologous and paralogous protein sequences from different species and by enabling cross-references between different versions of the same gene curated independently by different database groups. The web interface enables sophisticated queries on lineage-specific patterns of gene/protein family proliferation and loss. This rich dataset is serving as a platform for the unification of sequence-anchored comparative maps across taxonomic families of plants. The Phytome web interface can be accessed at the following URL: . Batch homology searches and bulk downloads are available upon free registration

    Beginning to track 1000 datasets from public repositories into the published literature

    Get PDF
    Data sharing provides many potential benefits, although the amount of actual data reused is unknown. Here we track the reuse of data from three data repositories (NCBI\u27s Gene Expression Omnibus, PANGAEA, and TreeBASE) by searching for dataset accession number or unique identifier in Google Scholar and using ISI Web of Science to find articles that cited the data collection article. We found that data reuse and data attribution patterns vary across repositories. Data reuse appears to correlate with the number of citations to the data collection article. This preliminary investigation has demonstrated the feasibility of this method for tracking data reuse

    The two faces of secondary contact on islands

    Get PDF
    Hybridization is thought to have played an important role in shaping the evolutionary history of diverse island taxa. Here, we propose an ecological and evolutionary framework for understanding the causes and consequences of heterospecific mating on islands – with and without hybridization. There are a number of reasons why secondary contact is expected to be unusually frequent on islands and why heterospecific mating may be a frequent result of such secondary contact. An important contributor is the suite of species and community traits that are enriched by the colonization process itself. The consequences of heterospecific mating may depend, to a large degree, on whether one of the species is introduced. Due to generally weak intrinsic reproductive isolation between island endemics, secondary contact will frequently lead to hybrid establishment and interspecific gene flow. By contrast, due to relatively longer divergence times between endemic and introduced taxa, there will typically be strong postzygotic isolation between them, and recurrent mating within zones of secondary contact will often lead instead to local exclusion by reproductive interference. Since recent human activity is bringing many insular endemics into contact with introduced relatives, this latter outcome may be an underappreciated conservation threat

    Tracking the evolution of alternatively spliced exons within the Dscam family

    Get PDF
    BACKGROUND: The Dscam gene in the fruit fly, Drosophila melanogaster, contains twenty-four exons, four of which are composed of tandem arrays that each undergo mutually exclusive alternative splicing (4, 6, 9 and 17), potentially generating 38,016 protein isoforms. This degree of transcript diversity has not been found in mammalian homologs of Dscam. We examined the molecular evolution of exons within this gene family to locate the point of divergence for this alternative splicing pattern. RESULTS: Using the fruit fly Dscam exons 4, 6, 9 and 17 as seed sequences, we iteratively searched sixteen genomes for homologs, and then performed phylogenetic analyses of the resulting sequences to examine their evolutionary history. We found homologs in the nematode, arthropod and vertebrate genomes, including homologs in several vertebrates where Dscam had not been previously annotated. Among these, only the arthropods contain homologs arranged in tandem arrays indicative of mutually exclusive splicing. We found no homologs to these exons within the Arabidopsis, yeast, tunicate or sea urchin genomes but homologs to several constitutive exons from fly Dscam were present within tunicate and sea urchin. Comparing the rate of turnover within the tandem arrays of the insect taxa (fruit fly, mosquito and honeybee), we found the variants within exons 4 and 17 are well conserved in number and spatial arrangement despite 248–283 million years of divergence. In contrast, the variants within exons 6 and 9 have undergone considerable turnover since these taxa diverged, as indicated by deeply branching taxon-specific lineages. CONCLUSION: Our results suggest that at least one Dscam exon array may be an ancient duplication that predates the divergence of deuterostomes from protostomes but that there is no evidence for the presence of arrays in the common ancestor of vertebrates. The different patterns of conservation and turnover among the Dscam exon arrays provide a striking example of how a gene can evolve in a modular fashion rather than as a single unit

    Phenex: Ontological Annotation of Phenotypic Diversity

    Get PDF
    Phenex is a platform-independent desktop application designed to facilitate efficient and consistent annotation of phenotypic variation using Entity-Quality syntax, drawing on terms from community ontologies for anatomical entities, phenotypic qualities, and taxonomic names. Despite the centrality of the phenotype to so much of biology, traditions for communicating information about phenotypes are idiosyncratic to different disciplines. Phenotypes seem to elude standardized descriptions due to the variety of traits that compose them and the difficulty of capturing the complex forms and subtle differences among organisms that we can readily observe. Consequently, phenotypes are refractory to attempts at data integration that would allow computational analyses across studies and study systems. Phenex addresses this problem by allowing scientists to employ standard ontologies and syntax to link computable phenotype annotations to evolutionary character matrices, as well as to link taxa and specimens to ontological identifiers. Ontologies have become a foundational technology for establishing shared semantics, and, more generally, for capturing and computing with biological knowledge
    corecore